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Markov chains are a natural and well understood tool for describing one-dimensional patterns in 
time or space. We show how to infer fc-th order Markov chains, for arbitrary k, from finite data 
by applying Bayesian methods to both parameter estimation and model-order selection. Extending 
existing results for multinomial models of discrete data, we connect inference to statistical mechanics 
through information-theoretic (type theory) techniques. We establish a direct relationship between 
Bayesian evidence and the partition function which allows for straightforward calculation of the 
expectation and variance of the conditional relative entropy and the source entropy rate. Finally, 
we introduce a novel method that uses finite data-size scaling with model-order comparison to infer 
the structure of out-of-class processes. 
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I. INTRODUCTION 

Statistical inference of models from small data samples 
is a vital tool in the miderstanding of natural systems. In 
many problems of interest data consists of a sequence of 
letters from a finite alphabet. Examples include analysis 
of sequence information in biopolymers [l|, 0] , investiga- 
tion of one-dimensional spin systems Q, models of nat- 
ural languages and coarse-grained models of chaotic 
dynamics 0- This diversity of potential application 
has resulted in the development of a variety of represen- 
tations for describing discrete- valued data series. 

We consider the fc-th order Markov chain model class 
which uses the previous k letters in a sequence to pre- 
dict the next letter. Inference of Markov chains from 
data has a long history in mathematical statistics. Early 
vi^ork focused on maximum likelihood methods for esti- 
mating the parameters of the Markov chain 0, H, 0] ■ This 
work often assumed a given fixed model order. That is, 
no model comparison across orders is done. This work 
also typically relied on the assumed asymptotic normal- 
ity of the likelihood when estimating regions of confi- 
dence and when implementing model comparison. As a 
result, the realm of application has been limited to data 
sources where these conditions are met. One consequence 
of these assumptions has been that data sources which 
exhibit forbidden words, symbol sequences which are not 
allowed, cannot be analyzed with these methods. This 
type of data violates the assumed normality of the like- 
lihood function. 

More recently, model comparison in the maximum like- 
lihood approach has been extended using various infor- 



mation criteria. These methods for model-order selection 
are based on extensions of the likelihood ratio and allow 
the comparison of more than two candidate models at 
a time. The most widely used are Akaike's information 
criteria (AIC) [lo| and the Bayesian information crite- 
ria (BIG) (Although the latter is called Bayesian, it 
does not employ Bayesian model comparison in the ways 
we will present here.) In addition to model selection us- 
ing information criteria, methods from information the- 
ory and machine learning have also been developed. Two 
of the most widely employed are minimum description 
length (MDL) (l^ and structural risk minimization [l^. 
Note that MDL and Bayesian methods obtain similar 
results in some situations [lj|. However, to the best of 
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our knowledge, structural risk minimization has not been 
adapted to Markov chain inference. 

We consider Bayesian inference of the Markov chain 
model class, extending previous results BiEm. We 
provide the details necessary to infer a Markov chain of 
arbitrary order, choose the appropriate order (or weight 
orders according to their probability), and estimate the 
data source's entropy rate. The latter is important for 
estimating the intrinsic randomness and achievable com- 
pression rates for an information source [13. The ability 
to weight Markov chain orders according their probabil- 
ity is unique to Bayesian methods and unavailable in the 
model selection techniques discussed above. 

In much of the literature just cited, steps of the in- 
ference process are divided into (i) point estimation of 
model parameters, (ii) model comparison (hypothesis 
testing), and (iii) estimation of functions of the model 
parameters. Here we will show that Bayesian inference 
connects all of these steps, using a unified set of ideas. 
Parameter estimation is the first step of inference, model 
comparison a second level, and estimation of the entropy 
rate a final step, intimately related to the mathematical 
structure underlying the inference process. This view of 
connecting model to data provides a powerful and unique 
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understanding of inference not available in the classical 
statistics approach to these problems. As we demon- 
strate, each of these steps is vital and implementation of 
one step without the others does not provide a complete 
analysis of the data-model connection. 

Moreover, the combination of inference of model pa- 
rameters, comparison of performance across model or- 
ders, and estimation of entropy rates provides a powerful 
tool for understanding Markov chain models themselves. 
Remarkably, this is true even when the generating data 
source is outside of the Markov chain model class. Model 
comparison provides a sense of the structure of the data 
source, whereas estimates of the entropy rate provide a 
description of the inherent randomness. Bayesian infer- 
ence, information theory, and tools from statistical me- 
chanics presented here touch on all of these issues within 
a unified framework. 

We develop this as follows, assuming a passing famil- 
iarity with Bayesian methods and statistical mechanics. 
First, we discuss estimation of Markov chain parame- 
ters using Bayesian methods, emphasizing the use of the 
complete marginal posterior density for each parameter, 
rather than point estimates with error bars. Second, we 
consider selection of the appropriate memory k given a 
particular data set, demonstrating that a mixture of or- 
ders may often be more appropriate than selecting a sin- 
gle order. This is certainly a more genuinely Bayesian 
approach. In these first two parts we exploit different 
forms of Bayes' theorem to connect data and model class. 

Third, we consider the mathematical structure of the 
evidence (or marginal likelihood) and draw connections 
to statistical mechanics. In this discussion we present a 
method for estimating entropy rates by taking derivatives 
of a partition function formed from elements of each step 
of the inference procedure. Last, we apply these tools to 
three example information sources of increasing complex- 
ity. The first example belongs to the Markov chain model 
class, but the other two are examples of hidden Markov 
models (HMMs) that fall outside of that class. We show 
that the methods developed here provide a powerful tool 
for understanding data from these sources, even when 
they do not belong to the model class being assumed. 



for the posterior we obtain Bayes' theorem: 

^ ' ' ^ P{D\M) ^ ' 

The prior P {0\M) specifics our assumptions regarding 
the model parameters. We take a pragmatic view of the 
prior, considering its specification to be a statement of as- 
sumptions about the chosen model class. The likelihood 
P{D\9,M) describes the probability of the data given 
the model. Finally, the evidence (or marginal likelihood) 
P {D\AI) is the probability of the data given the model. 
In the following sections we describe each of the quanti- 
ties in detail on our path to giving an explicit expression 
for the posterior. 



A. Markov chains 

The first step in inference is to clearly state the as- 
sumptions that make up the model. This is the founda- 
tion for writing down the likelihood of a data sample and 
informs the choice of prior. We assume that a single data 
set of length N is the starting point of the inference and 
that it consists of symbols St from a finite alphabet A, 



D = sqSi . . . sn-1 , St & a. 



(2) 



Wc introduce the notation s j to indicate a length-A: se- 
quence of letters ending at position t: e.g., V| = S3S4. 

The k-th order Markov chain model class assumes fi- 
nite memory and stationarity in the data source. The 
finite memory condition, a generalization of the conven- 
tional Markov property, can be written 



N-2 



(3) 



t=fc-i 



thereby factoring into terms which depend only on pre- 
ceding words of Icngth-fc. The stationarity condition can 
be expressed 



II. INFERRING MODEL PARAMETERS 



P(st|*s"t-l) =Pist+nX 



t+m-1) 



(4) 



In the first level of Bayesian inference we develop a 
systematic relation between the data D, the chosen model 
class M, and the vector of model parameters 9. The 
object of interest in the inference of model parameters is 
the posterior probability density P {d\D, AI). This is the 
probability of the model parameters given the observed 
data and chosen model. To find the posterior we first 
consider the joint distribution P {6, D\M) over the data 
and model parameters given that one has chosen to model 
the source with a representation in a certain class M. 
This can be factored in two ways: P {e\D, M) P {D\M) 
or P {D\9, M) P {9\M). Setting these equal and solving 



for any (i, m). Equation|4]results in a simplification of the 
notation because we no longer need to track the position 
index, p{st = s| = V') = p(s| V^) for any t. Given 
these two assumptions, the model parameters of the /c-th 
order Markov chain M^. are 



se A, 



eA^] 



(5) 



A normalization constraint is placed on these parameters 
^gg_4p(s|V'') = 1 for each word V*^. 

The next step is to write down the elements of Bayes' 
theorem specific to the /c-th order Markov chain. 
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B. Likelihood 

Given a sample of data D — sqSi . . . sat-i, the like- 
lihood can be written down using the Markov property 
of Eq. ([3]) and the stationarity of Eq. ([4]). This results in 
the form 



(6) 



where ni^^s) is the number of times the word V^s oc- 
curs in the sample D. For future use we also introduce 
notation for the number of times a word V*^ has been 
observed n(V^) = Z^sg^ "(^''^)- note that Eq. ([6]) 
is conditioned on the start sequence V*^ = SqSi . . . Sfc-i. 

C. Prior 

The prior P(0\M) is used to specify assumptions about 
the model to be inferred before the data is considered. 
Here we use conjugate priors for which the posterior dis- 
tribution has the same functional form as the prior. Our 
choice allows us to derive exact expressions for many 
quantities of interest in inference. This provides a pow- 
erful tool for understanding what information is gained 
during inference and. especially, model comparison. 

The exact form of the prior is determined by our as- 
signment of hyperparameters a( V^s) for the prior which 
balance the strength of the modeling assumptions en- 
coded in the prior against the weight of the data. For 
a A:-th order Markov chain, there is one hyperparameter 
for each word ^''s, given the alphabet under consider- 
ation. A useful way to think about the assignment of 
values to the hyperparameters is to relate them to fake 
counts h{^''s), such that Q!(V°s) = Ti(^'°s) + 1. In this 
way, the a(V^s) can be set to reflect knowledge of the 
data source and the strength of these prior assumptions 
can be properly weighted in relation to the actual data 
counts n(V^s). 

The conjugate prior for Markov chain inference is a 
product of Dirichlet distributions, one for each word V*^. 
It restates the finite-memory assumption from the model 
definition: 



Pie,\Mk) = n 



(7) 



seA 



(Sec App. [A] for relevant properties of Dirichlet distri- 
butions.) The prior's hyperparameters {a(V's)} must 
be real and positive. We also introduce the more com- 
pact notation a(V^) = J2seA'^(*^''^)- '^^^'^ function 
r(a;) = (a; — 1)! is the well known Gamma function. The 
(5-function constrains the model parameters to be prop- 
erly normalized: '^seAP(^\^'') ~ ^ each V*^. 



Given this functional form, there are at least two ways 
to interpret what the prior says about the Markov chain 
parameters 9k ■ In addition to considering fake counts 
ri(-), as discussed above, we can consider the range of 
fluctuations in the estimated ^(51^*^). Classical statis- 
tics would dictate describing the fluctuations via a single 
value with error bars. This can be accomplished by find- 
ing the average and variance of p(s| V^') with respect to 
the prior. The result is: 

a{^''s) 



Varr 



a(V''s)(Q!(V'') - a(^''s)) 



(8) 



(9) 



' a(V'=)2(H- a(V=)) 

A second method, more in line with traditional 
Baycsian estimation, is to consider the marginal distribu- 
tion for each model parameter. For a Dirichlet distribu- 
tion, the marginal for any one parameter will be a Beta 
distribution. With this knowledge, a probability density 
can be provided for each Markov chain parameter given 
a particular setting for the hyperparameters a(V^s). In 
this way, the prior can be assigned and analyzed in sub- 
stantial detail. 

A common stance in model inference is to assume all 
things are a-priori equal. This can be expressed by as- 
signing Q!(V's) = 1 for all V*^ £ A'^ and s £ A, adding 
no fake counts n(V's). This assignment results in a uni- 
form prior distribution over the model parameters and a 
prior expectation: 



E 



prior 



[p{s\r'^)] = 1/\A\ 



(10) 



D. Evidence 



Given the likelihood and prior derived above, the evi- 
dence P{D\M) is seen to be a simple normalization term 
in Baycs' theorem. In fact, the evidence provides the 
probability of the data given the model Mfc and so plays 
a fundamental role in model comparison. Formally, the 
definition is 

P{D\Mk)=J d9k P{D\9k,Mk)P{dk\Mk), (11) 

where we can see that this term can be interpreted as 
an average of the likelihood over the prior distribution. 
Applying this to the likelihood in Eq. ([6|) and the prior 
in Eq. ([7]) produces 

r(a(^'^)) 



P{D\Mk) 



n 



n.G^r(a(V'=s)) 



(12) 



n.g^r(n(^'--g) + a(V^s)) 
r(n(*i"'=) + a('i"'=)) 



As we will see, this analytic expression results in the abil- 
ity to make useful connections to statistical mechanics 
techniques when estimating entropy rates. This is an- 
other benefit of choosing a conjugate prior with known 
properties. 
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E. Posterior 

Using Bayes' theorem Eq. ([T]) the resuhs of the three 
previous sections can be combined to obtain the posterior 
distribution over the parameters of the k-th order Markov 
chain. One finds: 



Pi9k\D,Mk) = n 



(13) 



fc\n(*s"'=s)+a(*s"'°s)-l 



As noted in selecting the prior, the resuhing form is a 
Dirichlet distribution with modified parameters. This is 
a result of choosing the conjugate prior: cf. the forms of 
Eq. © and Eq. 

From Eq. (jl3p the estimation of the model parameters 
p(s|V^) and the uncertainty of these estimates can be 
given using the known properties of the Dirichlet distri- 
bution. As with the prior, there are two main ways to 
understand what the posterior tells us about the fiuctu- 
ations in the estimated Markov chain parameters. The 
first uses a point estimate with "error bars" . We obtain 
these from the mean and variance of the p(s|V^') with 
respect to the posterior, finding 



E 



post 



Varpost[p(s| s )] 



f. . n(*i'*-'s) + q;(V^s) 



(14) 



(n(V^) + a(V^))^ 

(n(V') + Q!(V'')) - (7i(^''s) + a(^''s)) 
(n(V'=) + q;(V^) + 1) 



(15) 



This is the posterior mean estimate (PME) of the model 
parameters. 

A deeper understanding of Eq. p4|l is obtained through 
a simple factoring: 



Epost[p(s|V 



1 



n('s~'^) + aC^'^) 



n{ s 



(16) 



+ ais") 



a{ s '^) 



where n(^'"'s)/n(V^) is the maximum likelihood 
estimate (MLE) of the model parameters and 
Q!(V^s)/a(V^) is the prior expectation given in Eq. ([5]). 
In this form, it is apparent that the posterior mean 
estimate is a weighted sum of the MLE and prior 
expectation. As a result, we can say that the posterior 
mean and maximum likelihood estimates converge to 



the same value for nCs"*^) ^ a(V^). Only when the 
data is scarce, or the prior is set with strong conviction, 
does the Bayesian estimate add corrections to the MLE. 

A second method for analyzing the resulting posterior 
density is to consider the marginal density for each pa- 
rameter. As discussed with the prior, the marginal for a 
Dirichlet is a Beta distribution. As a result, we can ei- 
ther provide regions of confidence for each parameter or 
simply inspect the density function. The latter provides 
much more information about the inference being made 
than the point estimation just given. In our examples, to 
follow shortly, we plot the marginal posterior density for 
various parameters of interest to demonstrate the wealth 
of information this method provides. 

Before we move on, we make a final point regarding 
the estimation of inference uncertainty. The form of the 
posterior is not meant to reflect the potential fluctuations 
of the data source. Instead, the width of the distribution 
reflects the possible Markov chain parameters which are 
consistent with observed data sample. These are distinct 
notions and should not be conflated. 



F. Predictive distribution 

Once we have an inferred model, a common task is 
to estimate the probability of a new observation £)i'^<^'^) 
given the previous data and estimated model. This is 
implemented by taking an average of the likelihood of 
the new data: 



P(7^(""")|0,,Mfc)= H p(s|V 



kxmCT s) 



(17) 



s ''i£A'',seA 

with respect to the posterior distribution [l§ |: 



p(^D^ne^'^\D,Mk) - J dekP{D^"^'"'^\0k,Mk) (18) 
X P{ek\D,Mk) ■ 

We introduce the notation m(V^s) to indicate the num- 
ber of times the word s s occurs in 

jjinew)^ This method 
has the desirable property, compared to point estimates, 
that it takes into account the uncertainty in the model 
parameters 6^ as reflected in the form of the posterior 
distribution. 

The evaluation of Eq. ^TE\\ follows the same path as 
the calculation for the evidence and produces a similar 
form; we flnd: 

nu i^,ivi,j 11 1 n.,^r(n(VM + «(V^s)) 



s^eA'' 



(19) 



YlseA r(n(V'=^s) + m( V^'s) + aC^'^s)) 
r(n( V*^) -|- m(T^) + a( V*^)) 
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III. MODEL COMPARISON 

With the ability to infer a Markov chain of a given or- 
der k, a common sense question is to ask how do we 
choose the correct order given a particular data set? 
Baycsian methods have a systematic way to address this 
through the use of model comparison. 

In many ways, this process is analogous to inferring 
model parameters themselves, which we just laid out. 
We start by enumerating the set of model orders to be 
compared A4 = {Mfe}^^"^, where kmin and k^ax cor- 
respond to the minimum and maximum order to be in- 
ferred, respectively. Although we will not consider an 
independent, identically distributed (IID) model {k ~ 0) 
here, we do note that this could be included using the 
same techniques described below. 

We start with the joint probability P{Mk, D\M) of a 
particular model Mk <£ M and data sample D, factoring 
it in two ways following Bayes' theorem. Solving for the 
probability of a particular model class we obtain 

^(M.IAA.) = ai»£p^. (20) 

where the denominator is the sum given by 

P{D\M)= J2 PiD\M'„M)P{M',\M) . (21) 

The probability of a particular model class in the set 
under consideration is driven by two components: the 
evidence P{D\'Mk, M), derived in Eq. (fT^ . and the prior 
over model classes P(Mfc|Al). 

Two common priors in model comparison arc: (i) all 
models are equally likely and (ii) models should be pe- 
nalized for the number of free parameters used to fit the 
data. In the first instance P(Mfc|7M) ~ i/\M\ is the 
same for all orders k. However, this factor cancels out 
because it appears in both the numerator and denomi- 
nator. As a result, the probability of models using this 
prior becomes 



P{Mk\D,M) = 



P{D\Mk,M) 



(22) 



In the second case, a common penalty for the number 
of model parameters is 



P{Mk\M) = 



exp(-|Mfc 



EM'eA^exp(-|M',|) 



(23) 



where |Mfc| is the number of free parameters in the 
model. For a fc-th order Markov chain, the number of 
free parameters is 

|Mfc| = |^|^-(l^|-l) , (24) 

where |^| is the alphabet size. Thus, model probabilities 
under this prior take on the form 



PiMk\D,M) 



P{D\Mk,M)exp{-\Mk\) 
Em; P{D\M'„M)cxpHM', 



We note that the normalization sum in Eq. (|23p cancels 
because it appears in both the numerator and denomina- 
tor. 

Bayesian model comparison has a natural Occam 's ra- 
zor in the model comparison process [l8l |. This means 
there is a natural preference for smaller models even when 
a uniform prior over model orders is applied. In this light, 
a penalty for the number of model parameters can be 
seen as a very cautious form of model comparison. Both 
of these priors, Eq. (|22p and Eq. (pS)) . will be considered 
in the examples to follow. 

A note is in order on computational implementation. 
In general, the resulting probabilities can be extremely 
small, easily resulting in numerical underflow if the equa- 
tions are not implemented with care. As mentioned 
in [iGj . computation with extended logarithms can be 
used to alleviate these concerns. 



IV. INFORMATION THEORY, STATISTICAL 
MECHANICS, AND ENTROPY RATES 

An important property of an information source is its 
entropy rate h^, which indicates the degree of intrinsic 
randomness and controls the achievable compression. A 
first attempt at estimating a source's entropy rate might 
consist of plugging a Markov chain's estimated model pa- 
rameters into the known expression [iTj . However, this 
does not accurately reflect the posterior distribution de- 
rived above. This observation leaves two realistic alter- 
natives. The first option is to sample model parameters 
from the posterior distribution. These samples can then 
be used to calculate a set of entropy rate estimates that 
reflect the underlying posterior distribution. A second 
option, which we take here, is to adapt methods from 
type theory and statistical mechanics previously devel- 
oped for IID models [l^ to Markov chains. To the best 
of our knowledge this is the first time these ideas have 
been extended to inferring Markov chains; although cf. 

In simple terms, type theory shows that the probabil- 
ity of an observed sequence can be written in terms of 
the Kullhack-Leihler (KL) distance and the entropy rate. 
When applied to the Markov chain inference problem the 
resulting form suggests a connection to statistical me- 
chanics. For example, we will show that averages of the 
KL-distance and entropy rate with respect to the poste- 
rior arc found by taking simple derivatives of a partition 
function. 

The connection between inference and information the- 
ory starts by considering the product of the prior Eq. ([7]) 
and hkelihood Eq. 



p{ek\Mu)p{D\ekMk) = PiD, OklMk) 



(26) 



(25) 



This forms a joint distribution over the observed data 
D and model parameters 9^ given the model order M^. 
Denoting the normalization constant from the prior as Z 



6 



to save space, this joint distribution is 



P{D,9k\Mk) = Z J] pisl'T'^r 



(V=s)+q('F'=s)-1 



(27) 



This form can be written, without approximation, in 
terms of conditional relative entropies and entropy 

rate hf_,[-]: 

P{D,9k\Mk) - z2-'^'=(^[«ll^l+'^-[«l) (28) 

^ 2+\A\''+H-DlU\\P]+h^lU]) ^ 

where jSk = V» s [nC^'^s) + Q!(*i~'^s)] and the distribu- 
tion of true parameters is P = {p(V'),p(s|^'^)}. The 
distributions Q and U are given by 



r, j f<-k^ n(V'') +a(V'=) 

n( s ^) + a( s 
r,w(s|V'' 



U 



\A\ 



1 



(29) 



(30) 



where Q is the distribution defined by the posterior 
mean and f7 is a uniform distribution. The information- 
theoretic quantities used above are given by 



V[Q\\P] = ^g(V'=)g(s|V'=)log 



(31) 



p[s\ S 

KiQ] = - E 9C^'')9(s|V'^)log2 9(s|V'^) .(32) 

The form of Eq. ((28|) and its relation to the evidence 
suggests a connection to statistical mechanics: The evi- 
dence P{D\Mk) = / d9kP{D, 6'fc|Mfc) is a partition func- 
tion Z = P(£'|Mfe). Using conventional techniques, the 
expectation and variance of the "energy" 



EiQ,P)^V[Q\\P] + h^[ 



(33) 



are obtained by taking derivatives of the logarithm of the 
partition function with respect to f3k'- 



E 



post 



E{Q,P)\ 



Yarpost[E{Q,P)' 



d_ 

' log2d f3k 
1 52 



log 2 (34) 



, „ logZ . (35) 
log2 9/?2 ^ ^ ' 



The factors of log 2 in the above expressions come from 
the decision to use base 2 logarithms in the definition 
of our information-theoretic quantities. This results in 
values in bits rather than nats [l7l |. 

To evaluate the above expression, we take advantage 
of the known form for the evidence provided in Eq. (jl2p . 
With the definitions au = X]v= «(^'^) and 



Q!(V'^) 



ak 



,r(s|V^) 



a{ s '^) 



(36) 



the negative logarithm of the partition function can be 
written 

-logZ = ^ logr[«fcr(V^)r(s|V'^)] (37) 

s ^ ,s 

- ^logr[a,.r(^'^-)] +^logr[/3fc(7(V'=)] 

g k g k 

- logr[/3feg(V*^)g(s|V'=)] . 



From this expression, the desired expectation is found 
by taking derivatives with respect to Pk', we find that 

Epo.t[i?(g,p)] = [M^")] 

s ^ 

E '?(V'^)g(s|^'=)V'('" [/3fc<z(V'^)g(s|V''0] . 

s ^ ,s 

(38) 

The variance is obtained by taking a second derivative 
with respect to j3k, producing 



Varpost[£;(Q,P)] = -i^E9(^')'^^'^ [/3fc'?(^')] 

S ^ 

+1^ E 9(V'=)2g(s|V^)2v.(i) [A9(^'=)g(s|V'=)] . 

(39) 

In both of the above the polygamma function is defined 
■0'"'(a;) = d"+Vrfa;"+i logr(x). (For further details, con- 
sult a reference such as [2l|.) 

From the form of Eq. ([38|) and Eq. , the meaning 
is not immediately clear. We can use an expansion of the 
n = polygamma function 



(x) =logx-l/2x + 0{x-^) 



(40) 



valid for a; 3> 1, however, to obtain an asymptotic form 
for Eq. dSll); we find 



(41) 



Epost[E{Q, P) ] = i/[<7(^'^-)<7(s| V^)] - i/[g(V^)] 

+^l^l'(l^l-i) + o(i//3D- 



From this we see that the first two terms make up the 
entropy rate hfj_[Q] = H[q(^^)q{s\^'^)\ - H[q(~s^)\ and 
the last term is associated with the conditional relative 
entropy between the posterior mean distribution Q and 
true distribution P. 

In summary, we have found the average of conditional 
relative entropy and entropy rate with respect to the pos- 
terior density. This was accomplished by making connec- 
tions to statistical mechanics through type theory. Unlike 
sampling from the posterior to estimate the entropy rate, 
this method results in an analytic form which approaches 
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hf^[P] as the inverse of the data size. This method for 
approximating also provides a computational benefit. 
No eigenstates have to be found from the Markov tran- 
sition matrix, allowing for the storage of values in sparse 
data structures. This provides a distinct computational 
advantage when large orders or alphabets are considered. 

Finally, it might seem awkward to use the expecta- 
tion of Eq. (|33|1 for estimation of the entropy rate. This 
method was chosen because it is the form that naturally 
appears in writing down the likelihood-prior combination 
in Eq. (|28p . As a result of using this method, most of the 
results obtained above are without approximation. We 
were also able to show this expectation converges to the 
desired value in a well behaved manor. 

V. EXAMPLES 

To explore how the above produces a robust inference 
procedure, let's now consider the statistical inference of 
a series of increasingly complex data sources. The first, 
called the golden mean process, is a first-order Markov 
chain. The second data source is called the even process 
and cannot be represented by a Markov chain with fi- 
nite order. However, this source is a deterministic HMM, 
meaning that the current state and next output symbol 
uniquely determine the next state. Finally, we consider 
the simple nondeterministic source, so named since its 
smallest representation is as a nondeterministic HMM. 
(Nondeterminism here refers to the HMM structure: the 
current state and next output symbol do not uniquely 
determine the next state. This source is rep resented by 
an infinite-state deterministic HMM p^. l23j|.) 

The golden mean, even, and simple nondeterministic 
processes can all be written down as models with two in- 
ternal states — call them A and B. However, the complex- 
ity of the data generated from each source is of markedly 
different character. Our goal in this section is to con- 
sider the three main steps in inference to analyze them. 
First, we consider inference of a first-order Markov chain 
to demonstrate the estimation of model parameters with 
uncertainty. Second, we consider model comparison for a 
range of orders k. This allows us to discover structure in 
the data source even though the true model class cannot 
be captured in all cases. Finally, we consider estimation 
of entropy rates from these data sources, investigating 
how randomness is expressed in them. 

While investigating these processes wc consider aver- 
age data counts, rather than sample counts from specific 
realizations, as we want to focus specifically on the av- 
erage performance of Bayesian inference. To do this we 
take advantage of the known form of the sources. Each 
is described by a transition matrix T, which gives tran- 
sitions between states A and B: 



T ^ 



p{A\A) p{B\A) 
p{A\B) p{B\B) 



(42) 



Although two of our data sources are not finite Markov 
chains, the transition matrix between internal states is 



Markov. This means the matrix is stochastic (all rows 
sum to one) and we are guaranteed an eigenstate tt 
with eigenvalue one: ttT = n. This eigenstate de- 
scribes the asymptotic distribution over internal states: 

TT=[piA),p{B)]. 

The transition matrix can be divided into labeled ma- 
trices r'^^ which contain those elements of T that output 
symbol s. For our binary data sources one has 



(43) 



Using these matrices, the average probability of words 
can be estimated for each process of interest. For exam- 
ple, the probability of word 01 can be found using 

p(01) =7rr(")r(i)^, (44) 

where ^ is a column vector with all I's. In this way, for 
any data size TV, we estimate the average count for a 
word as 



n( V^'s) ^{N - k) pC^^s) 



(45) 



Average counts, obtained this way, will be the basis for 
all of the examples to follow. 

In the estimation of the true entropy rate for the ex- 
amples we use the formula 

/i^ = - ^ p{v)^ p{s\v)\og2p{s\v) (46) 

v£{A,B} seA 

for the the golden mean and even processes, where 
p{s\v) = Ty'!^ is the probability of a letter s given the 
state V and p{v) is the asymptotic probability of the state 
V which can be found as noted above. For the simple 
nondeterministic source this closed-form expression can- 
not be applied and the entropy rate must be found using 
more involved methods; see [221 for further details. 



A. Golden mean process: In-class modeling 

The golden mean process can be represented by a sim- 
ple Ist-order Markov chain over a binary alphabet char- 
acterized by a single (shortest) forbidden word = 00. 
The defining labeled transition matrices for this data 
source are given by 



y{0) ^ 



1/2 




7^(1) ^ 



1/2 
1 



(47) 



Figure [T] provides a graphical representation of the cor- 
responding hidden Markov chain. Inspection reveals a 
simple relation between the internal states A and B and 
the output symbols and 1. An observation of indi- 
cates a transition to internal state B and a 1 corresponds 
to state A, making this process a Markov chain over Os 
and Is. 

For the golden mean the eigenstate is 7? = 
[p{A),p{B)] ^ (2/3,1/3). With this vector and the la- 
beled transition matrices any desired word count can be 
found as discussed above. 




FIG. 1: A deterministic hidden Markov chain for the golden 
mean process. Edges are labeled with the output symbol and 
the transition probability: symbol \ probability. 
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1. Estimation of Mi Parameters 

To demonstrate the effective inference of the Markov 
chain parameters for the golden mean process we consider 
average counts for a variety of data sizes N. For each 
size, the marginal posterior for the parameters p(0|l) and 
p(l|0) is plotted in Fig.[2l The results demonstrate that 
the shape of the posterior effectively describes the distri- 
bution of possible model parameters at each TV and con- 
verges to the correct values of p(0|l) = l/2andp(l|0) = 1 
with increasing data. 

Point estimates with a variance can be provided for 
each of the parameters, but these numbers by themselves 
can be misleading. However, the estimate obtained by 
using the mean and variance of the posterior are a more 
effective description of the inference process than a max- 
imum likelihood estimate with estimated error given by a 
Gaussian approximation of the likelihood alone. As Fig. [5] 
demonstrates, in fact, a Gaussian approximation of un- 
certainty is an ineffective description of our knowledge 
when the Markov chain parameters arc near their upper 
or lower limits at and 1 . Probably the most effective set 
of numbers to provide consists of the mean of the poste- 
rior and a region of confidence. These would most accu- 
rately describe asymmetries in the uncertainty of model 
parameters. Although we will not do that here, a brief 
description of finding regions of confidence is provided 
in App. lAll 

2. Selecting the Model Order k 

Now consider the selection of the appropriate order k 
from golden mean realizations. As discussed above, the 
golden mean process is a first order Markov chain with 
k ~ I. As a result, we would expect model comparison 
to select this order from the available possibilities. To 
demonstrate this, we consider orders fc = 1 — 4 and per- 
form model comparison with a uniform prior over orders 
(Eq. ((22)) ) and with a penalty for the number of model 
parameters (Eq. (12511 ). 

The results of the model comparisons are given 
in Fig. [3l The top panel shows the probability for each 
order fc as a function of the sample size, using a uniform 
prior. For this prior over orders. Mi is selected with any 
reasonable amount of data. However, there does seem to 



iV = 50 
- - iV = 100 
■ ■ iV = 200 
. .- iV = 400 




FIG. 2: A plot of the inference of Mi model parameters for 
the golden mean process. For each data sample size A'^, the 
marginal posterior is plotted for the parameters of interest: 
p(0|l) in the top panel and p(l|0) in the lower panel. The true 
values of the parameters are p(0|l) = 1/2 and p(l|0) — 1. 



be a possibility to over-fit for small data size A'' < 100. 
The bottom panel shows the model probability with a 
penalty prior over model order k. This removes the over- 
fitting at small data sizes and produces an offset which 
must be overcome by the data before higher fc is selected. 
This example is not meant to argue for the penalty prior 
over model orders. In fact, Bayesian model comparison 
with a uniform prior does an effective job using a rela- 
tively small sample size. 

3. Estimation of Entropy Rate 

We can also demonstrate the convergence of the aver- 
age for E{Q,P) = D[Q\\P] + /i^[Q] given in Eq. §^ to 
the correct entropy rate for the golden mean process. We 
choose to show this convergence for all orders fc = 1 — 4 
discussed in the previous section. This exercise demon- 
strates that all orders greater than or equal to fc = 1 
effectively capture the entropy rate. However, the con- 
vergence to the correct values for higher-order fc takes 
more data because of a larger initial value of 
This larger value is simply due to the larger number of 
parameters for higher-order Markov chains. 

In evaluating the value of -D[Q||P] + h^[Q] for differ- 
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FIG. 3: Model comparison for Markov chains of order k — 
1 — 4 using average counts from the golden mean process. 
Sample sizes from iV = 100 to = 1, 000 in steps of AA^ = 5 
are used to generate these plots. The top panel displays the 
model probabilities using a uniform prior over orders k. The 
bottom panel displays the effect of a penalty for model size. 



ent sample lengths, we expect that the PME estimated 
Q will converge to the true distribution P. As a result, 
the conditional relative entropy should go to zero with 
increasing N. For the golden mean process, the known 
value of the entropy rate is ft.^ = 2/3 bits per symbol. 
Inspection of Fig. 0] demonstrates the expected conver- 
gence of the average from Eq. (p8|) to the true entropy 
rate. 

The result of our model comparison from the previous 
section could also be used in the estimation of the entropy 
rate. As we saw in Fig. [3l there are ranges of sample 
length N where the probability of orders k = 1,2 are 
both nonzero. In principle, an estimate of should 
be made by weighting the values obtained for each k by 
the corresponding order probability P(Mfc|_D, A^). As 
we can sec from Fig. |4l the estimates of the entropy rate 
for fc = 1, 2 are also very similar in this range of N. As a 
result, this additional step would not have a large effect 
for entropy rate estimation. 



B. Even process: Out-of-class modeling 

We now consider a more difficult data source called the 
even process. The defining labeled transition matrices are 
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FIG. 4: The convergence of Epost [ f*) ] to the true en- 
tropy rate = 2/3 bits per symbol (indicated by the 
gray horizontal line) for the the golden mean process. As 
demonstrated in Eq. (|41[) . the conditional relative entropy 
-D[Q|jP] — > as 1/A'^. This results in the convergence of hfj,[Q] 
to the true entropy rate. 



given by 



7^(0) 



1/2 




1/2 

1 



(48) 



As can be seen in Fig. [5l the node-edge structure is 
identical to the golden mean process but the output sym- 
bols on the edges have been changed slightly. As a result 
of this shuffle, the states A and B can no longer be asso- 
ciated with a simple sequence of O's and I's. Whereas the 
golden mean has the irreducible set of forbidden words 
T = {00}, the even process has a countably infinite set 
jr = {oi2"+io : ji = 0, 1, 2, . . .} 
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FIG. 5: Deterministic hidden Markov chain representation 
of the even process. This process cannot be represented 
as a finite-order (nonhidden) Markov chain over the output 
symbols Os and Is. The set of irreducible forbidden words 
T = {01^"+^0 : n = 0, 1, 2, . . .} reflects the fact that the pro- 
cess generates blocks of I's, bounded by Os, that are even in 
length, at any length. 

In simple terms, the even process produces blocks of 
I's which are even in length. This is a much more com- 
plicated type of memory than we saw in the golden mean 
process. For the Markov chain model class, where a word 
of length k is used to predict the next letter, this would 
require an infinite-order fc. It would be necessary to keep 



track of all even and odd strings of I's, irrespective of 
the length. As a result, the properties of the even pro- 
cess mean that a finite Markov chain cannot represent 
this data source. 

This example is then a demonstration of what can be 
learned in a case of out-of-class modeling. We are inter- 
ested, therefore, in how well Markov chains approximate 
the even process. Wc expect that model comparison will 
select larger k as the size of the data sample increases. 
Does the model selection tells us anything about the un- 
derlying data source despite the inability to exactly cap- 
ture its properties? As we will see, we do obtain in- 
triguing hints of the true nature of the even process from 
model comparison. Finally, can we estimate the entropy 
rate of the process with a Markov chain? As wc will sec, 
a high k is needed to do this effectively. 



1. Estimation of Mi Parameters 

In this section wc consider an Mi approximation of 
the even process. Wc expect the resulting model to accu- 
rately capture lcngth-2 word probabilities as N increases. 
In this example, we consider the true model to be the best 
approximation possible by a fc = 1 Markov chain. From 
the labeled transition matrices given above we can cal- 
culate the appropriate values for p(0|l) and p(l|0) using 
the methods described above. Starting from the asymp- 
totic distribution 7? = [p{A),p{B)] ~ [2/3, 1/3] we obtain 
p(0|l) = p(10)/p(l) = 1/4 and p(l|0) = p{Ol)/p{0) = 
1/2. 

As we can see from Fig. [51 a first-order Markov chain 
can be inferred without difficulty. The values obtained 
are exactly as expected. However, these values do not 
tell us much about the nature of the data source by 
themselves. This points to the important role of model 
comparison and entropy rate estimation in understand- 
ing this data. 



2. Selecting the Model Order k 

Now consider the selection of Markov chain order 
fc = 1 — 4 for a range of data sizes A^. Recall that the even 
process cannot be represented by a finite-order Markov 
chain over the output symbols and 1 . As a consequence, 
we expect higher k to be selected with increasing data N , 
as more data statistically justifies more complex models. 
This is what happens, in fact, but the way in which or- 
ders are selected as we increase N provides structural 
information we could not obtain from the inference of a 
Markov chain of fixed order. 

If we consider Fig. [71 an interesting pattern becomes 
apparent. Orders with even k are preferred over odd. 
In this way model selection is hinting at the underlying 
structure of the source. The Markov chain model class 
cannot represent the even process in a compact way, but 
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FIG. 6: A plot of the inference of A/i model parameters for the 
even process. For a variety of sample sizes A'', the marginal 
posterior forp(0|l) (top panel) andp(l|0) (bottom panel) are 
shown. The true values of the parameters are p(Ojl) = 1/4 
and p(l|0) = 1/2. 



inference and model comparison combined provide useful 
information about the hidden structure of the source. 

In this example we also have regions where the proba- 
bility of multiple orders k are equally probable. The sam- 
ple size at which this occurs depends on the prior over 
orders which is employed. When this happens, proper- 
ties estimated from the Markov chain model class should 
use a weighted sum of the various orders. As we will see 
in the estimation of entropy rates, this is not as critical. 
At sample sizes where the order probabilities are similar, 
the estimated entropy rates are also similar. 



3. Estimation of Entropy Rate 

Entropy rate estimation for the even process turns 
out to be a more difficult task than one might expect. 
In Fig.[8lwe see that Markov chains of orders 1 — 6 arc un- 
able to effectively capture the true entropy rate. In fact, 
experience shows that an order fc = 10 Markov chain or 
higher is needed to get close to the true value of /i^ = 2/3 
bits per symbol. Note also the factor of 20 longer real- 
izations that are required compared, say, to the golden 
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FIG. 7: Model comparison for Markov chains of order fc = 1 — 
4 for average data from the even process. The top panel shows 
the model comparison with a uniform prior over the possible 
orders k. The bottom panel demonstrates model comparison 
with a penalty for the number of model parameters. In both 
cases the k — 4 model is chosen over lower orders as the 
amount of data available increases. 



mean example. 

As discussed above, a weighted sum of Epost [ -D [Q 1 1 P] + 
hfj,[Q] ] could be employed in this example. For the esti- 
mate this is not critical because the different orders pro- 
vide roughly the same value at these points. In fact, these 
points correspond to where the estimates of E{Q,P) 
cross in Fig. [S] They arc samples sizes where apparent 
randomness can be explained by structure and increased 
order k. 



Simple Nondeterministic Source: Out-of-class 
modeling 
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FIG. 8: The convergence of Epoat[-D[Q|(P] + h^[Q]] to the 
true entropy rate /i^ = 2/3 bits per symbol for the the even 
process. The true value is indicated by the horizontal gray 
line. Experience shows that a fc = 10 Markov chain is needed 
to effectively approximate the true value of hfj,. 



Using the state-to-state transition matrix T = T^'^^ + 
we find the asymptotic distribution for the hidden 
states to be TT = [p{A),p{B)] = [1/2,1/2]. Each of the 
hidden states is equally likely; however, a 1 is always 
produced from state A, while there is an equal chance of 
obtaining a or 1 from state B. 
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FIG. 9: A hidden Markov chain representation of the simple 
nondeterministic process. This example also cannot be rep- 
resented as a finite-order Markov chain over outputs and 
1. It, however, is more complicated than the two previous 
examples: Only the observation of a provides the observer 
with information regarding the internal state of the underly- 
ing process; observing a 1 leaves the internal state ambiguous. 



The simple nondeterministic source adds another level 
of challenge to inference. As its name suggests, it is de- 
scribed by a nondeterministic HMM. Considering Fig. [9] 
we can sec that a 1 is produced on every transition ex- 
cept for the B A edge. This means there are many 
paths through the internal states that produce the same 
observable sequence of Os and Is. The defining labeled 
transition matrices for this process are given by 



y(0) ^ 




1/2 



1/2 1/2 
1/2 



(49) 



1. Estimation of Mi Parameters 

Using the asymptotic distribution derived above, the 
parameters of an inferred first-order Markov chain should 
approach p(0|l) = p(10)/p(l) = 1/3 and p(l|0) = 
p(01)/p(0) = 1. As we can see from Fig.lTOl the inference 
process captures these values very effectively despite the 
out-of-class data source. 
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FIG. 10: Marginal density for Mi model parameters for the 
simple nondeterministic process: The curves for each data 
size A'' demonstrate a well behaved convergence to the correct 
values: p(0|l) = 1/3 and p(l|0) = 1. 



2. Selecting the Model Order k 

Here we consider the comparison of Markov chain mod- 
els of orders /c = 1 — 4 when appUcd to data from the sim- 
ple nondeterministic source. As with the even process, 
we expect increasing order to be selected as the amount 
of available data increases. In Fig. [11] we see that this is 
exactly what happens. 

Unlike the even process, there is no preference for even 
orders. Instead, we observe a systematic increase in order 
with larger data sets. We do note that the amount of data 
need to select a higher order does seem to be larger than 
for the even process. Here the distribution over words is 
more important and more subtle than the support of the 
distribution (those words with positive probability). 



FIG. 11: Model comparison for Markov chains of order k = 
1 — 4 for data from the simple nondeterministic process. The 
top panel shows the model comparison with a uniform prior 
over the possible orders k. The bottom panel demonstrates 
model comparison with a penalty for the number of model 
parameters. Note the scale on the horizontal axis — it takes 
much more data for the model comparison to pick out higher 
orders for this process compared to the previous examples. 



ing Eq. (gel) m. However, a value of hf, « 0.677867 bits 
per symbol has been obtained in [2^ . 

Figure [T^] shows the results of entropy-rate estimation 
using Markov chains of order fc = 1 — 6. These results 
demonstrate that the entropy rate can be effectively es- 
timated with low-order k and relatively small data sam- 
ples. This is an interesting result, as we might expect 
estimation of the entropy rate to be most difficult in this 
example. Instead we find that the even process was a 
more difficult test case. 



VI. DISCUSSION 



3. Estimation of Entropy Rate 

Estimation of the entropy rate for the simple nonde- 
terministic source provides an interesting contrast to the 
previous examples. As discussed when introducing the 
examples, this data source is a nondeterministic HMM 
and the entropy rate cannot be directly calculated us- 



The examples presented above provide several interest- 
ing lessons in inference, model comparison, and estimat- 
ing randomness. The combination of these three ideas 
applied to a data source provides information and intu- 
ition about the structure of the underlying system, even 
when modeling out-of-class processes. 

In the examples of Mi estimates for each of the sources 
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FIG. 12: The convergence of Epost[D[Q\\P] + /ip[Q]] to the 
true entropy rate ft^ « 0.677867 bits per symbol for the simple 
nondeterministic source. The true value is indicated by the 
gray horizontal line. 



we see that the Baycsian methods provide a powerful 
and consistent description of Markov chain model pa- 
rameters. The marginal density accurately describes the 
uncertainty associated with these estimates, reflecting 
asymmetries which point estimation with error bars can- 
not capture. In addition, methods described in App. lA II 
can be used to generate regions of confidence of any type. 

Although the estimates obtained for the Markov chain 
model parameters were consistent with the data source 
for words up to length k + 1, they did not capture the 
true nature of the system under consideration. This 
demonstrates that estimation of model parameters with- 
out some kind of model comparison can be very mislead- 
ing. Only with the comparison of different orders did 
some indication of the true properties of the data source 
become clear. Without this step, misguided interpreta- 
tions are easily obtained. 

For the golden mean process, a fc = 1 Markov chain, 
the results of model comparison were predictably uninter- 
esting. This is a good indication that the correct model 
class is being employed. However, with the even process 
a much more complicated model comparison was found. 
In this selection of even k over odd hinted at the 

distinguishing properties of the source. In a similar way, 
the results of model comparison for the simple nondeter- 
ministic source selected increasing order with larger N. 
In both out-of-class modeling examples, the increase in 
selected order without end is a good indication that the 
data source is not in the Markov chain class. (A parallel 
techriique is found in hierarchical e-machine reconstruc- 
tion [24|.) Alternatively, there is an indication that very 
high-order dependencies are important in the description 
of the process. Either way, this information is important 
since it gives an indication to the modeler that a more 
complicated dynamic is at work and all results must be 
treated with caution. 

Finally, we considered the estimation of entropy rates 
for the example data sources. In two of the cases, the 



golden mean process and the simple nondeterministic 
source, short data streams were adequate. This is not 
unexpected for the golden mean, but for the simple non- 
deterministic source this might be considered surprising. 
For the even process, the estimation of the entropy rate 
was markedly more difficult. For this data source, the 
countably infinite number of forbidden words makes the 
support of the word distribution at a given length impor- 
tant. As a result, a larger amount of data and a higher- 
order Markov chain are needed to find a decent estimate 
of randomness from that data source. In this way, each 
of the steps in Bayesian inference allow one to separate 
structure from randomness. 



VII. CONCLUSION 

We considered Bayesian inference of k-th order Markov 
chain models. This included estimating model parame- 
ters for a given fc, model comparison between orders, and 
estimation of randomness in the form of entropy rates. 
In most approaches to inference, these three aspects are 
treated as separate, but related endeavors. However, we 
find them to be intimately related. An estimate of model 
parameters without a sense of whether the correct model 
is being used is misguided at best. Model comparison 
provides a window into this problem by comparing vari- 
ous orders fc within the model class. Finally, estimating 
randomness in the form of an entropy rate provides more 
information about the trade-off between structure and 
randomness. To do this we developed a connection to 
the statistical mechanical partition function, from which 
averages and variances were directly calculable. For the 
even process, structure was perceived as randomness and 
for the simple nondeterministic source randomness was 
easily estimated and structure was more difficult to find. 
These insights, despite the out-of-class data, demonstrate 
the power of combining these three methods into one ef- 
fective tool for investigating structure and randomness in 
finite strings of discrete data. 
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APPENDIX A 



Dirichlet Distribution 



We supply a brief overview of the Dirichlet distribution 
for completeness. For more information, a reference such 
as [2^ should be consulted. In simple terms, the Dirich- 
let distribution is the multinomial generalization of the 
Beta distribution. The probability density function for q 
elements is given by 



Dir(fe}) 



r(a) 



mJor(a.) 



9-1 

i=0 



9-1 

'Hp" 

1=0 



(Al) 



The variates must satisfy pi G [0, 1] and X]i=o Pi — ^■ 
The hyperparameters {a^} of the distribution, must be 
real and positive and we use the notation a = X!i=o 
The average, variance, and covariance of the parameters 
Pi are given by, respectively. 



Var[pj] 

Cov[pj,pi] 



a 

a2(l + a) ' 
ajai 
q;2 (1 + a) 



(A2) 
(A3) 
(A4) 



2. Marginal distributions 

An important part of understanding uncertainty in the 
inference process is the ability to find regions of confi- 



dence from a marginal density. The marginal is obtained 
from the posterior by integrating out the dependence on 
all parameters except for the parameter of interest. For 
a Dirichlet distribution, the marginal density is known to 
be a Beta distribution [231 , 



Bcta(pi 



r(a) 



r(ai)r(a ■ 



a, 



(A5) 



3. Regions of confidence from the marginal density 

From the marginal density provided in Eq. (|A5|) a cu- 
mulative distribution function can be obtained using the 
incomplete Beta integral 



Pr(K < x) 



dpiBeta{pi) . 



(A6) 



Using this form, the probability that a Markov chain 
parameter will be between a and b can be found using 
Pr(a < Pi < b) ~ Pr(pi < b) — Pr{pi < a). For a con- 
fidence level R, between zero and one, we then want to 
find (a, 6) such that R = Pr(a < pi < b). The incom- 
plete Beta integral and its inverse can be found using 
computational methods, see [HI for details. 
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